1 Introduction to the dataset

The dataset for this competition is a relational set of files describing customers’ orders over time. The goal of the competition is to predict which products will be in a user’s next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.

1.1 Load Packages

1.2 Function for multiple plot

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

1.3 Load files

aisles <- data.table::fread("aisles.csv")
departments <- data.table::fread("departments.csv")
order_prior <- data.table::fread("order_products__prior.csv")
order_train <- data.table::fread("order_products__train.csv")
orders <- data.table::fread("orders.csv")
products <- data.table::fread("products.csv")
sample_sub <- data.table::fread("sample_submission.csv")

1.4 Files glimpse

The list of aisles aisles

The list of departments Departments

Example of sample submission Sample submission

The list of orders: order_dow’ is the day of week. user 1 has 11 orders, 1 of which is in the train set, and 10 of which are prior orders. Orders

The list of Products Products

order_Prior: Contains previous order contents for all customers ‘reordered’ indicates that the customer has a previous order that contains the product Orders prior

Order_Train: in each order, products were added de the card by priority. some products were reordered

Orders Train

Orders Train

1.5 Reformating Datasets

aisles <- aisles %>%
          mutate(aisle = as.factor(aisle))

departments <- departments %>%
               mutate(department = as.factor(department))


order_prior <- order_prior %>%
               mutate(reordered = as.logical(reordered)) #%>%
               #mutate(product_id = as.factor(product_id))

order_train <- order_train %>%
               mutate(reordered = as.logical(reordered))

orders <- orders %>%
          mutate(eval_set = as.factor(eval_set)) %>%
          mutate(w_day = wday(order_dow , label = TRUE)) %>% # +1: weekdays have the interval [0:6] an dnot [1:7]
          mutate(user_id = as.factor(user_id))

products <- products %>%
            mutate(product_name = as.factor(product_name))

2 Market Basket analysis

In this first section, we would try explore the of details of the orders, the content of baskets, the best sold items.

2.1 Regroup items per basket

In each orders, costumers bougth multiple items forming a basket. We will regroup the orders per basket.

We need to join orders by order_id and then by products_id. This code take a while.

# transactions <- orders %>%
#   left_join(order_prior, by = "order_id") %>%
#   left_join(products, by = "product_id")
# 
# baskets <- transactions %>%
#            plyr::ddply(c("order_id", "user_id"),
#               function(df1) paste(df1$product_name,
#                                   collapse = ","))
# 
# colnames(baskets) <- c("Order_id","user_id","Baskets")
# baskets <- readRDS("baskets.RDS")
# tibble::glimpse(baskets)
# saveRDS(object = baskets, file = "baskets.RDS")

2.2 View the distribution of orders / transactions (hours and week days)

## time of ordering
p1 <- orders %>%
  ggplot(aes(x = order_hour_of_day)) +
    geom_histogram( stat="count", color= "blue") # ,bins = 24

## days of ordering
p2 <- orders %>%
  ggplot(aes(x = w_day)) +
  geom_histogram( stat= "count", color = "green")

## interval of days before Reordering
p3 <- orders %>%
  ggplot(aes(x = days_since_prior_order)) +
  geom_histogram(bins = 30, color = "yellow")


 # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2,3,3), 2, 2, byrow = TRUE)
 multiplot(p1, p2, p3, layout=layout)

we find: The main orders were done during the working hours of the day (8:17h). We assume that NA corresponds to saturday. A clear effect of the orders number is shown durinr the weekend. We observe a interval of reordering of 30 days. During this cycle we observe a picks at the dat 7, 15 and 30. We can observe more ordering during weekends during a month. But we can aloso observe diffrence of the number of orderinf between weekends.

2.3 plot the number of prior , train, and test orders

## count the number of prior orders
p3 <- orders %>%
  filter(eval_set == 'prior') %>%
  ggplot(aes(order_number)) +
  geom_histogram(stat = "count", color = "red")

p2 <- orders %>%
  filter(eval_set == 'train') %>%
  ggplot(aes(order_number)) +
  geom_histogram(stat = "count", color = "green")

p1 <- orders %>%
  filter(eval_set == 'test') %>%
  ggplot(aes(order_number)) +
  geom_histogram(stat = "count", color = "blue")

 # plot P1, P2, P3 in the same plot
 layout <- matrix(c(1,2,3,3), 2, 2, byrow = TRUE)
 multiplot(p1, p2, p3, layout=layout)

We have more prior (200000) orders, than traiun (15000) than test (7500). We observe a pick at 100 orders number for test and train samples.

2.4 plot the number of items per order

prior <- order_prior %>%
  group_by(order_id) %>%
  dplyr::summarise(n_orders = n()) %>%
  ggplot(aes(x= n_orders)) +
  geom_histogram(bins = 50, color = "yellow")+
  xlim(0,50) +
  labs(title = "Prior orders") +
  xlab("number of items per order") +
  ylab("n° orders")
  
train <- order_train %>%
  group_by(order_id) %>%
  dplyr::summarise(n_orders = n()) %>%
  ggplot(aes(x= n_orders)) +
  geom_histogram(bins = 50, color = "orange") +
  xlim(0,50) +
  labs(title = "Train orders") +
  xlab("number of items per order") +
  ylab("n° orders")

 # plot P1, P2, P3 in the same plot
 layout <- matrix(c(1, 2), 1, 2, byrow = TRUE)
 multiplot(prior, train, layout=layout)

We find: The most frequent nbr of order is about 5-6 items for the Prior and Train datasets. Ordering 25 items seems to be an exception.

2.5 Top costomers for Top products

# top costumers that purchased the largest number of items 
#the dataset is limited to 100 items maximum per costumer
top_costumers_items <- orders %>%
  group_by(user_id) %>%
  dplyr::summarise(n_orders = length(order_id)) %>%
  filter(n_orders < 100) %>%
  top_n(50, wt = n_orders) %>%
  #arrange(freq) %>%
  ggplot(aes(x = reorder(user_id, - n_orders), y = n_orders)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_point() +
  labs(title = "Top costumers which buy more items") +
  xlab("user_id") +
  ylab("nbr items")


top_costumers_visits <- orders %>%
  #filter(user_id == 123) %>%
  group_by(user_id) %>%
  dplyr::summarise(n_visits = last(order_number)) %>%
  filter(n_visits < 100) %>%
  top_n(20, n_visits) %>%
  ggplot(aes(x = user_id, y = n_visits)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col() +
  labs(title = "Number of visits per costumers") +
  xlab("User_id") +
  ylab("Order Number")
  

 top20_item_prior <- order_prior %>%
   group_by(product_id) %>%
   dplyr::summarise(n = n()) %>%
  top_n(20, wt = n) %>%
  left_join(products, by = 'product_id') %>%
  ggplot(aes(x = reorder(product_name, - n) , y = n )) +
  #coord_flip() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "green") +
  labs(title = "Top 20 products for Prior orders") +
  xlab("product_id") +
  ylab("n° orders")


  top20_item_train <- order_train %>%
  group_by(product_id) %>%
   dplyr::summarise(n = n()) %>%
  top_n(20, wt = n) %>%
  left_join(products, by = 'product_id') %>%
  ggplot(aes(x = reorder(product_name, - n) , y = n )) +
  #coord_flip() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "red") +
  labs(title = "Top 20 products for train orders") +
  xlab("product_id") +
  ylab("nbr orders")
  
  
  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2,3,4), 2, 2, byrow = TRUE)
 multiplot(top_costumers_visits, top_costumers_visits,  top20_item_prior,top20_item_train, layout=layout)

2.6 Top reordered items

reordered_train <- order_train %>% 
  filter(reordered == TRUE) %>%
  group_by(product_id) %>% 
  dplyr::summarise(freq = n()) %>% 
  top_n(10, wt = freq) %>%
  left_join(products, by = 'product_id') %>%
    ggplot(aes(x = reorder(product_name, - freq) , y = freq )) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "red") +
  labs(title = "Top reordered items for train dataset") +
  xlab("product name") +
  ylab("Reordered frequency")
  
  
reordered_prior <- order_prior %>% 
  filter(reordered == TRUE) %>%
  group_by(product_id) %>% 
  dplyr::summarise(freq = n()) %>% 
  top_n(10, wt = freq) %>%
  left_join(products, by = 'product_id') %>%
    ggplot(aes(x = reorder(product_name, - freq) , y = freq )) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "blue") +
  labs(title = "Top reordered items for prior dataset") +
  xlab("product name") +
  ylab("Reordered frequency")


  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2), 1, 2, byrow = TRUE)
 multiplot(reordered_train, reordered_prior, layout=layout)

We find: The main difference is in Organic Whole Milk which is not in the top 10 products in train orders.

2.7 Which most items added the first to the card

first_item_to_cart_prior <-
order_prior %>%
  filter(add_to_cart_order == 1) %>%
  #filter(product_id == "345") %>%
  group_by(product_id, reordered) %>%
  dplyr::summarise(n_first = n()) %>%
  arrange(desc(n_first)) %>%
  head(10) %>%
 # dplyr::top_n(10, wt = n_first) %>% doesn't work
  left_join(products, by = 'product_id') %>%
  ggplot(aes(x = product_name, y =  n_first)) +
   theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col(aes(fill = reordered)) +
  labs(title = "Top first added items to the cart (Prior)") +
  xlab("product name") +
  ylab("frequency of added the first to the cart")


first_item_to_cart_train <-
order_train %>%
  filter(add_to_cart_order == 1) %>%
  group_by(product_id, reordered) %>%
  #summarize(proportion_reordered = mean(reordered), n=n())
  dplyr::summarise(n_first = n()) %>%
  arrange(desc(n_first)) %>%
  head(10) %>%
 # dplyr::top_n(10, wt = n_first) %>% doesn't work
  left_join(products, by = 'product_id') %>%
  ggplot(aes(x = product_name, y =  n_first)) +
   theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col(aes(fill = reordered)) +
  labs(title = "Top first added items to the cart (Train)") +
  xlab("product name") +
  ylab("frequency of added the first to the cart")



  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2), 1, 2, byrow = TRUE)
 multiplot(first_item_to_cart_prior, first_item_to_cart_train, layout=layout)

we find: In general only the reordered items are added at the bigening of shopping. Only few cases with banana are not ordered for the first time and added the first to the cart.

2.8 reordering rate versus added rank to the cart

order_train %>% 
  group_by(product_id, add_to_cart_order) %>% 
  summarize(n_items_basket = n(), reordered_rate_basket = mean(reordered)) %>% 
  group_by(add_to_cart_order) %>%
  summarise(reordered_rate_all = mean(reordered_rate_basket))%>%
  ggplot(aes(x= add_to_cart_order, y = reordered_rate_all)) +
  geom_line()

2.9 Which most pourcentage of items added the first to the card

first_pct_item_to_cart_train <- order_train %>% 
  group_by(product_id, add_to_cart_order) %>% 
  summarize(count = n()) %>% 
  mutate(pct=count/sum(count)) %>% 
  filter(add_to_cart_order == 1, count>10) %>% 
  arrange(desc(pct)) %>% 
  left_join(products,by='product_id') %>%
  ungroup() %>% 
  select(product_name, pct, count) %>% 
  top_n(10, wt=pct) %>%
  ggplot(aes(x = reorder(product_name,-pct), y = pct)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col() +
   labs(title = "Top first % added items to the cart (Train)") +
  xlab("Product name") +
  ylab("Pourcentage (%) of item added the first to the cart")



first_pct_item_to_cart_prior <- order_prior %>% 
  group_by(product_id, add_to_cart_order) %>% 
  summarize(count = n()) %>% 
  mutate(pct=count/sum(count)) %>% 
  filter(add_to_cart_order == 1, count>10) %>% 
  arrange(desc(pct)) %>% 
  left_join(products,by='product_id') %>%
  ungroup() %>% 
  select(product_name, pct, count) %>% 
  top_n(10, wt=pct) %>%
  ggplot(aes(x = reorder(product_name,-pct), y = pct)) +
   geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col() +
   labs(title = "Top first % added items to the cart (Prior)") +
  xlab("Product name") +
  ylab("Pourcentage (%) of item added the first to the cart")

  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2), 1, 2, byrow = TRUE)
 multiplot(first_pct_item_to_cart_prior, first_pct_item_to_cart_train, layout=layout)

2.10 Top90 sellers of Banana and Strawberries

banana_id <- products %>%
             filter(product_name == "Banana") %>%
             #select(product_id) %>%
            .$product_id
strawberries_id <- products %>%
             filter(product_name == "Strawberries") %>%
             #select(product_id) %>%
            .$product_id

spring_water_id <- products %>%
             filter(product_name == "Spring Water") %>%
             #select(product_id) %>%
            .$product_id

asparagus_id <- products %>%
             filter(product_name == "Asparagus") %>%
             #select(product_id) %>%
            .$product_id

# filter the orders with banana
order_train %>%
  filter(product_id %in% c(banana_id, strawberries_id)) %>%
  left_join(orders, by = "order_id") %>%
  group_by( order_id, user_id) %>%
   dplyr::summarise(n_orders = last(order_number)) %>%
  filter(n_orders == 90) 
FALSE # A tibble: 6 x 3
FALSE # Groups:   order_id [6]
FALSE   order_id user_id n_orders
FALSE      <int> <fct>      <int>
FALSE 1   271953 145628        90
FALSE 2  1578927 24195         90
FALSE 3  1589791 188446        90
FALSE 4  1854209 166449        90
FALSE 5  2803296 179192        90
FALSE 6  2975947 195993        90

2.11 Explore Days interval of reordering items

interval_item_reordered_train <- order_train %>%
  left_join(orders, by = "order_id") %>%
  group_by(days_since_prior_order) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(days_since_prior_order, y = mean_reorder)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "red") +
  labs(title = "Mean Interval (days) of reordered items (Train)") +
  xlab("Days") +
  ylab("Mean reordered (%)")


interval_item_reordered_prior <- order_prior %>%
  left_join(orders, by = "order_id") %>%
  group_by(days_since_prior_order) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(days_since_prior_order, y = mean_reorder)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_col( color = "blue") +
  labs(title = "Mean Interval (days) of reordered items (Prior)") +
  xlab("Days") +
  ylab("Mean reordered (%)")


  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2), 1, 2, byrow = TRUE)
 multiplot(interval_item_reordered_train, interval_item_reordered_prior, layout=layout)

With 0 days items are reordered at 0.85%. After 30 days the same items are ordered at 0.45%

2.12 Explore number of orders and reordering items

grp_pdt_train <- order_train %>%
  #left_join(orders, by = "order_id") %>%
  group_by(product_id) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(n, y = mean_reorder)) +
  #theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_point(size = 0.1, alpha = 0.3) +
  geom_smooth(color="red")+
  labs(title = "Number of reordered items per product_id (Train)") +
  xlab("Number of orders") +
  ylab("Mean reordered (%) per product_id") +
  coord_cartesian(xlim=c(0,5000))

grp_order_train <- order_train %>%
  #left_join(orders, by = "order_id") %>%
  group_by(order_id) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(n, y = mean_reorder)) +
  #theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_point(color= "red", size = 0.1, alpha = 0.3) +
  labs(title = "Number of reordered items per order_id (Train)") +
  xlab("Number of orders") +
  ylab("Mean reordered (%) per order_id") 

grp_pdt_prior <- order_prior %>%
  #left_join(orders, by = "order_id") %>%
  group_by(product_id) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(n, y = mean_reorder)) +
  #theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_point( size = 0.1, alpha = 0.3) +
  geom_smooth(color="blue")+
  labs(title = "Number of reordered items per product_id (Prior)") +
  xlab("Number of orders") +
  ylab("Mean reordered (%) per product_id") +
  coord_cartesian(xlim=c(0,10000))

grp_order_prior <- order_prior %>%
  #left_join(orders, by = "order_id") %>%
  group_by(order_id) %>%
  summarize(mean_reorder = mean(reordered), n = n()) %>%
  ggplot(aes(n, y = mean_reorder)) +
  #theme(axis.text.x = element_text(angle=45, hjust=1)) +
  geom_point(color="blue", size = 0.1, alpha = 0.3) +
  labs(title = "Number of reordered items per order_id (Prior)") +
  xlab("Number of orders") +
  ylab("Mean reordered (%) per order_id") 

  # plot P1, P2, P3, p4 in the same plot
 layout <- matrix(c(1,2, 3, 4), 2, 2, byrow = TRUE)
 multiplot(grp_pdt_train, grp_order_train,grp_pdt_prior, grp_order_prior , layout=layout)
FALSE `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
FALSE `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2.13 Visualize Departments and Aisles

library(treemap)

items_per_aisle <- products %>% 
  group_by(department_id, aisle_id) %>% 
  summarize(n_items = n()) %>%
  left_join(departments,by="department_id") %>%
  left_join(aisles,by="aisle_id")

tree_aisle <- order_train %>%
  group_by(product_id) %>%
  dplyr::summarise(count_ordered_item = n()) %>%
  left_join(products, by = "product_id") %>%
  ungroup() %>%
  group_by(department_id, aisle_id) %>%
  summarize(sumcount = sum(count_ordered_item)) %>%
  left_join(items_per_aisle, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

treemap(tree_aisle,index = c("department","aisle"),
             vSize = "onesize", 
             vColor = "department",
             palette = "Set3",
             title = "super market Map", 
             sortID = "-sumcount",
             border.col = "#FFFFFF",
             type = "categorical",
             fontsize.legend = 0,
             bg.labels = "#FFFFFF")

treemap(tree_aisle,index = c("department","aisle"),
             vSize = "sumcount", 
             vColor = "department",
             palette = "Set3",
             title = "super market Map", 
           #  sortID = "-sumcount",
             border.col = "#FFFFFF",
             type = "categorical",
             fontsize.legend = 0,
             bg.labels = "#FFFFFF")

2.14 Look for proportion of costumers that reorders the same products

order_number_upper2 <- order_prior %>%
  group_by(order_id) %>%
  dplyr::summarise(mean_redordered_item_per_basket = mean(reordered), n_items_per_basket = n()) %>%
  left_join(orders, by = "order_id") %>%
  filter(order_number > 2)

order_number_upper2 %>%
  #filter(eval_set =="prior") %>% 
  group_by(user_id) %>%
  dplyr::summarise(sum_orders = sum(mean_redordered_item_per_basket == 1, na.rm = TRUE), ratio_reordered_items = sum_orders/n()) %>%
  filter(ratio_reordered_items == 1) %>%
  arrange(desc(sum_orders)) %>%
  head(10)
FALSE # A tibble: 10 x 3
FALSE    user_id sum_orders ratio_reordered_items
FALSE    <fct>        <int>                 <dbl>
FALSE  1 99753           97                     1
FALSE  2 55331           49                     1
FALSE  3 106510          49                     1
FALSE  4 111365          47                     1
FALSE  5 74656           46                     1
FALSE  6 170174          45                     1
FALSE  7 12025           43                     1
FALSE  8 164779          37                     1
FALSE  9 37075           34                     1
FALSE 10 110225          33                     1

Here i look for customers who just reorder the same products again all the time. To search those I look at all orders (excluding the first order), where the percentage of reordered items is exactly 1 (This can easily be adapted to look at more lenient thresholds). We can see there are in fact 3,487 customers, just always reordering products. user_id 99753 reorded the same items (same basket countain) for 97 visits/orders.

2.15 the basket for the most fidel user_id

order_number_upper2 %>%
filter(user_id == 99753) %>%
  left_join(order_prior, by = "order_id") %>%
  left_join(products, by = "product_id") %>%
  select(product_name, user_id, order_id, w_day, days_since_prior_order) %>%
  arrange(order_id) %>%
  head(10)
## # A tibble: 10 x 5
##    product_name             user_id order_id w_day days_since_prior_order
##    <fct>                    <fct>      <int> <ord>                  <dbl>
##  1 Organic Whole Milk       99753      46614 Tue                        2
##  2 Organic Reduced Fat Milk 99753      46614 Tue                        2
##  3 Organic Whole Milk       99753      67223 Wed                        2
##  4 Organic Reduced Fat Milk 99753      67223 Wed                        2
##  5 Organic Whole Milk       99753     214506 Sun                        5
##  6 Organic Reduced Fat Milk 99753     214506 Sun                        5
##  7 Organic Whole Milk       99753     240832 Tue                        2
##  8 Organic Reduced Fat Milk 99753     240832 Tue                        2
##  9 Organic Whole Milk       99753     260804 Sun                        4
## 10 Organic Reduced Fat Milk 99753     260804 Sun                        4

The user_id buy always the same two items Organic milk, maybe for a baby.

orders %>%
   left_join(sample_sub, "order_id") %>%
  head(20)
##    order_id user_id eval_set order_number order_dow order_hour_of_day
## 1   2539329       1    prior            1         2                 8
## 2   2398795       1    prior            2         3                 7
## 3    473747       1    prior            3         3                12
## 4   2254736       1    prior            4         4                 7
## 5    431534       1    prior            5         4                15
## 6   3367565       1    prior            6         2                 7
## 7    550135       1    prior            7         1                 9
## 8   3108588       1    prior            8         1                14
## 9   2295261       1    prior            9         1                16
## 10  2550362       1    prior           10         4                 8
## 11  1187899       1    train           11         4                 8
## 12  2168274       2    prior            1         2                11
## 13  1501582       2    prior            2         5                10
## 14  1901567       2    prior            3         1                10
## 15   738281       2    prior            4         2                10
## 16  1673511       2    prior            5         3                11
## 17  1199898       2    prior            6         2                 9
## 18  3194192       2    prior            7         2                12
## 19   788338       2    prior            8         1                15
## 20  1718559       2    prior            9         2                 9
##    days_since_prior_order w_day products
## 1                      NA   Mon     <NA>
## 2                      15   Tue     <NA>
## 3                      21   Tue     <NA>
## 4                      29   Wed     <NA>
## 5                      28   Wed     <NA>
## 6                      19   Mon     <NA>
## 7                      20   Sun     <NA>
## 8                      14   Sun     <NA>
## 9                       0   Sun     <NA>
## 10                     30   Wed     <NA>
## 11                     14   Wed     <NA>
## 12                     NA   Mon     <NA>
## 13                     10   Thu     <NA>
## 14                      3   Sun     <NA>
## 15                      8   Mon     <NA>
## 16                      8   Tue     <NA>
## 17                     13   Mon     <NA>
## 18                     14   Mon     <NA>
## 19                     27   Sun     <NA>
## 20                      8   Mon     <NA>
orders %>%
   inner_join(sample_sub, "order_id") %>%
   head(10)
##    order_id user_id eval_set order_number order_dow order_hour_of_day
## 1   2774568       3     test           13         5                15
## 2    329954       4     test            6         3                12
## 3   1528013       6     test            4         3                16
## 4   1376945      11     test            8         6                11
## 5   1356845      12     test            6         1                20
## 6   2161313      15     test           23         1                 9
## 7   1416320      16     test            7         0                13
## 8   1735923      19     test           10         6                17
## 9   1980631      20     test            5         1                11
## 10   139655      22     test           16         5                 6
##    days_since_prior_order w_day    products
## 1                      11   Thu 39276 29259
## 2                      30   Tue 39276 29259
## 3                      22   Tue 39276 29259
## 4                       8   Fri 39276 29259
## 5                      30   Sun 39276 29259
## 6                       7   Sun 39276 29259
## 7                       7  <NA> 39276 29259
## 8                       8   Fri 39276 29259
## 9                      30   Sun 39276 29259
## 10                      1   Thu 39276 29259